Partitioned Narratives: Thick Mapping the 1947 Partition Archive

Introduction

The documentation below is the white paper for the essay: “Partitioned Narratives: Thick Mapping the 1947 Partition Archive.” It includes the R code and csv files necessary to reproduce the calculations. Some of the spatial manipulations were performed in QGIS 3.16 (Hannover). When possible, images and Python code chunks have provided for reproduceability. Some steps involved converting CSV files to Geopackage files, as this is a common GIS workflow it has been skipped.

Part 1: Priming the NER Extracted Data

Load packages

The following packages: tidyverse, tidygeocoder,tidytext,stringi,htmlTable, are necessary to run this script.

library(tidyverse)
library(tidygeocoder)
library(tidytext)
library(stringi)
library(htmlTable)
library(ggplot2)
library(scales)

Load location data

The loaded csv file is a cleaned up version of the one that results from scraping and running the data through NER. The cleaning process mostly involves removing false positives, consolidating similar locations (i.e. Bombay and Mumbai), and removing any corrupt data. This process also included coding the gender of the narrative and whether a person mentioned their occupation. Finally, for less known or ambigious locations we added the city and district to aid the geotagger.

partition_df <- read_csv("data/post_clean_locations.csv", na = c("", "NA"))

Reformat partition_df

The following procedure primes the data for analysis:

  • An address field is created by uniting the location, city, and country field
  • Remove unnecessary strings from data fields
  • Drop unnecessary columns
  • Keep all distinct addresses by person name. This prevents double counting locations in a person’s account
partition_distinct_locations <-  partition_df %>%
  group_by(name) %>%
  #create address field from locations, city, and country columns
  unite("address",
        locations:country,
        sep = ", ",
        na.rm = TRUE) %>%
  #remove unnecessary text and add total locations column
  mutate(
    age = str_remove(age, "Age in 1947: "),
    migrated_from = str_remove(migrated_from, "Migrated from: "),
    migrated_to = str_remove(migrated_to, "Migrated to: "),
  ) %>%
  #drop unnecessary columns
  select(name:migrated_to, gender:address) %>%
  #keep all distinct addresses. This helps reduce the query time for the geocoder.
  distinct(address, .keep_all = TRUE) %>%
  ungroup()

Part 2: Geocoding

Find distinct addresses

The geocoding the addresses can be quite time consuming. To save time, we can run only the distinct addresses and then join these back to partition_distinct_locations afterwards.

#create a vector of distinct addresses
addresses <- partition_distinct_locations %>%
  distinct(address)

Run geocoder

The script relies on the tidygeocoder package developed by Jesse Cambon, Diego Hernangómez, Christopher Belanger, Daniel Possenriede: tidygeocoder. The package allows users to select the geocoder of their choice. For the purposes of easy reproduceability OpenStreetMap (osm) was selected, though other services that require registration or login might be more accurate. This process is time consuming and has been commented out. The address file has been cached.

#Because the processing time is quite lengthy, this has been commented out. When running custom data remove the comment.

#addresses_geocoded <- geo(addresses$address, method = 'osm', full_results = FALSE)
Skip geocoding and read in the geocoded addresses
addresses_gecoded <- read_csv("data/addresses_geocoded.csv")

Join coordinates to distinct_partition_locations

The coordinates are joined to the existing dataframe distinct_partition_locations.

partition_geolocations <- partition_distinct_locations %>%
  left_join(addresses_gecoded)

Clean final table

The geocoder will not necessarily catch all locations. Some of the locations have to be geocoded and corrected manually. This process is involved, and has to be done through QGIS. Several additional fields were created to keep track of the changes:

  • known - Whether the location was ultimately found. FALSE indication that the location is a best guess

  • camp - Indicates that this was a refugee camp. This data was not used

  • resolved_location - The final location name for the coordinate. There may be a discrepancy between this and the initial address

  • admin - Indicates whether this a larger administrative area within which other locations fall. It also includes rivers. Admin areas are dropped from analysis because they are redundant. Likewise, as the position of the river is often unknowable, this too was dropped.

write_csv(partition_geolocations,
          "data/partition_geolocations_raw.csv")

Part 3: Statistical Overview

Import clean data

Read in the data file partition_geolocations_clean. This file is read-only to prevent accidental file corruption.

partition_clean <-
  read_csv("data/partition_geolocations_clean.csv", na = "NA")

Aggregate location totals

Generate a table for all analysis: only include non-administrative areas, unique locations for each person, counts per person, and total counts per location.

partition_statistics <- partition_clean %>%
  rename(latitude=9) %>%
  rename(longitude=10) %>% 
  filter(admin == FALSE) %>%
   filter(occupation!="No") %>% 
    mutate(PersonID = paste(name,"_",age)) %>% 
  group_by(PersonID) %>% 
  distinct() %>%
  add_count(PersonID, name = "loc_by_name") %>%
  ungroup() %>%
  add_count(resolved_location, name = "loc_total")

General Overview

#Get number of unique locations

unique_locations <- partition_statistics %>%
  ungroup() %>%
  select(resolved_location) %>%
  distinct() %>%
  nrow()

#Get number of unique people
unique_people <- partition_statistics %>%
  ungroup() %>%
  select(PersonID) %>%
  distinct() %>%
  nrow()

#Calculate mean locations mentioned
mean_locations <- partition_statistics %>%
  ungroup() %>%
  summarize(mean_locations = mean(loc_by_name))

There are 768 unique locations in the data set. These are distributed across 320 people. On average, each person mentions 9.49 locations.

Locations by gender

Broken down by gender, it is clear that the mean number of locations by men is higher than that of women.

mean_locations_gender <- partition_statistics %>%
  group_by(gender) %>%
  summarize(mean_gender = round(mean(loc_by_name), 2))

mean_locations_gender %>%
  addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>% 
  htmlTable(header = c("Gender", "Mean Locations Mentioned"))
Gender Mean Locations Mentioned
1 Female 8.67
2 Male 9.88

Locations by occupation

A similar trend emerges when accounting for occupation. Here, people who mention their occupation mention more locations.

mean_locations_occupation <- partition_statistics %>%
  group_by(occupation) %>%
  summarize(mean_occupation = round(mean(loc_by_name), 2))

mean_locations_occupation %>%
  addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>% 
  htmlTable(header = c("Occupation", "Mean Locations Mentioned"))
Occupation Mean Locations Mentioned
1 Not Mentioned 8.06
2 Yes 9.92

Locations by occupation and gender

The contrast between locations mentioned and the gender and whether occupation is mentioned becomes even starker when the values are disaggregated.

mean_locations_occ_gen <- partition_statistics %>%
  group_by(gender, occupation) %>%
  summarize(mean_location = round(mean(loc_by_name), 2))
mean_locations_occ_gen %>% 
   addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>% 
          htmlTable(header = c("Gender", "Occupation", "Mean Locations Mentioned" ))
Gender Occupation Mean Locations Mentioned
1 Female Not Mentioned 8.24
2 Female Yes 9.28
3 Male Not Mentioned 7.19
4 Male Yes 10.05

Percent mention of occupation by gender

Generally, men mentioned their occupations significantly more than women.

partition_statistics %>% 
  group_by(gender) %>%
  select(PersonID, gender, occupation) %>% 
  distinct() %>% 
  count(occupation) %>% 
  mutate(percent = paste(round(n/sum(n),2)*100,"%")) %>% 
  addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>% 
  htmlTable(header = c("Gender", "Occupation", "Number of People", "Percent" ))
Gender Occupation Number of People Percent
1 Female Not Mentioned 70 62 %
2 Female Yes 43 38 %
3 Male Not Mentioned 17 8 %
4 Male Yes 192 92 %

Distribution of locations mentioned

The distribution pattern of locations mentioned shows that men without occupations make a negligible impact on the mean number of locations mentioned. Meanwhile, the number of women without occupations is quite substantial and do tend to mention fewer locations. Even among those who mentione their occupation, the men’s distribution has a longer tail.

partition_statistics %>%
  distinct(PersonID, gender, occupation, loc_by_name) %>%
  ggplot(aes(loc_by_name, fill = gender)) +
  geom_histogram(
    color = "black",
    opacity = .8 ,
    alpha = .4,
    position = "identity"
  ) +
  scale_fill_brewer(palette = "Pastel2") +
  labs(title = "Histogram of Mentioned Locations by Occupation and Gender",
       x = "Occupation",
       y = "Number of Locations Mentioned",
       fill = "Gender") +
  facet_wrap(~ occupation) +
  theme_classic()
Figure 1: Locations Mentioned by Occupation and Gender

Figure 1: Locations Mentioned by Occupation and Gender

T-test and ANOVA score

#Generate t-scores

gender_ttest <- t.test(loc_by_name ~ gender, partition_statistics)
occupation_ttest <- t.test(loc_by_name ~ occupation, partition_statistics)

#Create dataframes of tscores
tscores <- map_df(list(gender_ttest, occupation_ttest), tidy)
tscores <- tscores[c("p.value")]

#Generate variables for ANOVA
gender_occupation <- partition_statistics %>%
  unite("gender_occupation", gender:occupation, remove = FALSE)
anova_gender_occupation <-
  aov(loc_by_name ~ gender_occupation, gender_occupation)

#Create dataframe for ANOVA score
anovascore <- map_df(list(anova_gender_occupation), tidy)
anovascore <- anovascore[c("statistic", "p.value")]

T-tests of both gender and occupation individually affirms what visual inspection already suggests: that the mean distribution is not random. A Welch Two Sample t-test was done both on the difference in means of locations by gender (p = 6.1e-16) and the difference in means of locations by occupation (p = 8.2e-34)affirms what visual inspection already suggests: that the mean distribution is not random. At the same time, an analysis of variance (ANOVA) test reveals an F-score of 44.42 and a p value of 5.7e-28, indicating that the variance between means is greater than the variance within means and not random.

Part 4: Spatial Analysis

The spatial analysis of the data set was done with QGIS. As these manipulations are hard to document, only their result is shown. There were a number of cases where the tidygeotagger did not properly catch all of the locations. These had to be added manually.

Location diversity

Departure locations

#Departure locations
part_from <- partition_statistics %>%
  mutate(migrated_from = str_extract(migrated_from, "[^,]+")) %>%
  drop_na(migrated_from) %>%
  select(PersonID, migrated_from, gender)  %>%
  distinct(PersonID, migrated_from, gender) %>%
  add_count(migrated_from, name = "total_location") %>%
  group_by(gender) %>%
  add_count(migrated_from, name = "loc_gender") %>%
  add_count(gender, name = "gender_tot") %>%
  mutate(percent = loc_gender / gender_tot) %>%
  select(-PersonID,-loc_gender,-gender_tot) %>%
  distinct() %>%
  #ungroup() %>%
  arrange(desc(total_location), gender) %>%
  top_n(5, total_location) %>%
  mutate(percent = percent(percent,2)) %>%
  select(-total_location)

#Number of departure locations by gender
gender_migration <- partition_statistics %>%
  drop_na(migrated_from) %>%
  distinct(migrated_from, gender) %>%
  group_by(gender) %>%
  count(gender)

The first thing that is notable about the departure locations is their diversity. While a plurality of people departed from Lahore (30%) and a second group from Rawalpindi (22%), there were many who departed from quite different locations. In fact, women departed from 43, while men departed from 89 different locations.

We can observe this diversity of points of departure by looking at a spatial representation of the direct lines of travel to Delhi and noting the diversity of points of origin.

Figure 2: Departure locations during Partition

Migrated From Gender Percent departure by Gender
1 Lahore Female 30%
2 Lahore Male 22%
3 Rawalpindi Female 12%
4 Rawalpindi Male 8%
5 Multan Female 4%
6 Multan Male 4%
7 Faisalabad Female 6%
8 Faisalabad Male 4%
9 Dera Ismail Khan Female 6%
10 Dera Ismail Khan Male 2%
Table 3: Top 5 Departure locations by gender

Transit Locations

partition_transfer <- partition_statistics %>%
  #Filter out Delhi as a final location
  filter(resolved_location != "Delhi") %>%
  
  #Clean up the migrated from and migrated to data
  mutate(migrated_from = str_extract(migrated_from, "[^,]+")) %>%
  mutate(migrated_to = str_extract(migrated_to, "[^,]+")) %>%
  
  #Remove all cases where the migrated from location is the same as one of the transit locations
  filter(migrated_from != resolved_location) %>%
  
  #Remobe all cases where the resolved location equals migrated to.
  filter(migrated_to != resolved_location) %>%
  
  #Get the number of transfer locations based on where people migrated from and their gender
  group_by(gender) %>%
  mutate(total_gender = n_distinct(PersonID)) %>%
  group_by(migrated_from, gender) %>%
  add_count(resolved_location, name = "migration_location", sort =
              TRUE) %>%
  #Calcuate the percentage as a share of all migration locations
  mutate(percent_transit = migration_location / total_gender) %>%
  
  #Clean up table for presentation
  select(migrated_from,
         gender,
         resolved_location,
         migration_location,
         percent_transit) %>%
  distinct(migrated_from, resolved_location, percent_transit) %>%
  arrange(desc(percent_transit), resolved_location) %>%
  ungroup() %>%
  top_n(10, percent_transit) %>%
  mutate(percent_transit = percent(percent_transit, 2)) %>%
  relocate(gender, .before = migrated_from)

Likewise the transit locations were also quite diverse. Amritsar occurs more frequently for both women (10%) and (8%), but does not stand out as the majority locations.

#Generate table
partition_transfer %>%
  addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>%
  htmlTable(header = c(
    "Gender",
    "Migrated From",
    "Transfer",
    "Percent of Respondents Transfered"
  ))
Gender Migrated From Transfer Percent of Respondents Transfered
1 Female Lahore Amritsar 10%
2 Male Lahore Amritsar 8%
3 Female Rawalpindi Lahore 6%
4 Female Lahore Rawalpindi 6%
5 Female Lahore Anarkali Bazaar 4%
6 Female Lahore Shimla 4%
7 Male Lahore Rawalpindi 4%
8 Female Faisalabad Amritsar 4%
9 Female Rawalpindi Karol Bagh 4%
10 Female Lahore Mumbai 4%
11 Female Lahore Mussoorie 4%

The spatial analysis requires several manipulations of the data that were done in QGIS. What follows is a brief outline.

Create from Locations

Note: the geocoding process is skipped for the purposes of this notebook

  • Subset the data into from hubs.
hub_from <- partition_statistics %>% 
            select(migrated_from) %>% 
            drop_na() %>% 
            filter(migrated_from!="TBA") %>% 
            distinct()
  • Geocode each from hub.
#hub_from_geo <- geo(hub_from$migrated_from, method = 'osm', full_results = FALSE)
  • Attach data back to from_hubs.
hub_from_join <- hub_from_geo %>%
  rename(migrated_from = address)

hub_from_join <- partition_statistics %>%
  left_join(hub_from_join)

hub_from_join <- hub_from_join %>%
  select(name, age, migrated_from, gender, occupation, lat, long) %>%
  filter(migrated_from != "TBA") %>%
  drop_na(migrated_from) %>%
  distinct()
  • Write from_hub file for geoprocessing.
write_csv(hub_from_join,"data/from_hubs.csv")
  • osm Will not necessarily catch all locations. Some of these have to be hand coded.
hub_from_join_clean <- read_csv("data/from_hubs_clean.csv")
  • Create line geometry for departure locations to Delhi.
hub_from_join_clean <- hub_from_join_clean %>% 
mutate(WKT = paste("LINESTRING(",long," ", lat, ",", "","77.2219388","28.6517178)")) 
write_csv(hub_from_join_clean, "data/hubs_to_delhi.csv") 
  • Measure distance from from_hub to Pakistan border using the NNjoin plugin for QGIS.
distance_to_border <- read_csv("data/distance_to_border.csv")

Evaluating distance to border

#Get the mean distance by gender
group_mean <- distance_to_border %>%
  group_by(gender) %>%
  summarise(grp_mean = mean(distance_km),
            group_median = median(distance_km))

total_mean <- distance_to_border %>% 
  ungroup() %>% 
  summarise(grp_mean = mean(distance_km))

#Get percentage of people who traveled more than 100km
more_than_100 <- distance_to_border %>%
  mutate(n = n()) %>%
  filter(distance_km > 100)  %>%
  summarise(more_than = n() / n) %>%
  distinct()

The path of travel to the border was quite distant for the majority of interviewees. With men and women both traveling more than 128km on average, and the median distance also exceeding 100km (women = 105km, men = 128km). Even though it is a rather arbitrary distance, the majority of people (59%) traveled more than 100km to get to the border. The sense that the majority interviewees travelled from quite far to even get to the border is also born out in the distribution of distances travelled.

distance_to_border %>%
  group_by(gender) %>%
  ggplot(aes(distance_km, fill = gender)) +
  geom_histogram(
    color = "black",
    opacity = .8 ,
    alpha = .4,
    position = "identity"
  ) +
  scale_color_brewer(palette = "Pastel2") +
  scale_fill_brewer(palette = "Pastel2") +
  labs(title = "Histogram of Distance to Border by Gender",
       x = "Distance in km",
       y = "Count",
       fill = "Gender") +
  facet_wrap(~ gender) +
  theme_classic() +
  geom_vline(data = group_mean,
             aes(xintercept = grp_mean, color = gender),
             linetype = "dashed") +
  theme(legend.position = "none") +
  geom_text(data = group_mean,
            aes(
              x = grp_mean,
              y = 0,
              label = paste("Mean Distance = ", round(grp_mean, 0), "km"),
              hjust = -.05,
              vjust = -22
            ))
Figure 3: Distribution of Distance to Border

Figure 3: Distribution of Distance to Border

Analyzing Hub and Spokes Model

Using QGIS it is possible to take all of the locations in each narrative and attach them to a central hub in this case Delhi.

to_hubs <- partition_statistics  %>% 
  filter(resolved_location!="Delhi") %>% 
mutate(WKT = paste("LINESTRING(",longitude," ", latitude, ",", "","77.2219388","28.6517178)"))  %>% 
  select(name,age,gender,occupation,resolved_location,migrated_from,migrated_to,PersonID,loc_by_name,loc_total,WKT)
write_csv(to_hubs, "data/hub_and_spoke.csv")

Visual inspection of the data reveals quite a diversity of locations mentioned. Notably, most of the locations outside of India are those in England and the US. At the regional level it makes sense that the majority of locations mentioned are related to Partition. The intensity of line intersections indicates that these locations predominated the narratives, which, given that they were the core of the interview, makes sense.

Figure 4: Locations Mentioned in the Interviews

At the more local level, it is interesting to note that the locations that are mentioned the most are those that were a transit point for refugees, such as the Kingsway Refugee Camp and the camp near Daryaganj, and those neighborhoods that were fundamentally shaped by Partition such as Karol Bagh and Lajpat Nagar. This tracks with the larger understanding of the impact of Partition on Delhi.

Figure 5: Locations Mentioned in Delhi-NCR

Spatial Statistics

  • Import spoke distances calculated in QGIS
spoke_distance <- read_csv("data/hub_spoke_distance.csv")
  • Calculate total spoke distance for each person.
distance_spokes_total <- spoke_distance %>%
  filter(migrated_from != "TBA") %>%
  select(name, age, gender, occupation, PersonID, distance_km) %>%
  group_by(name, age) %>%
  mutate(total = round(sum(distance_km))) %>%
  distinct(name, gender, occupation, total)
distance_spokes_total %>%
  group_by(gender) %>%
  ggplot(aes(total, fill = gender)) +
  geom_histogram(
    color = "black",
    opacity = .8 ,
    alpha = .4,
    position = "identity"
  ) +
  scale_color_brewer(palette = "Pastel2") +
  scale_fill_brewer(palette = "Pastel2") +
  labs(title = "Distribution of Total Spoke Length",
       x = "Distance in km",
       y = "Count",
       fill = "Gender") +
  facet_wrap(~ occupation) +
  theme_classic() 
Figure 5: Distribution of Total Spoke Length

Figure 5: Distribution of Total Spoke Length

spokes_mean <- spoke_distance %>%
  group_by(gender, occupation) %>%
  summarise(grp_mean = round(mean(distance_km),0))

spokes_mean %>%
  addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>% 
  htmlTable(header = c("Gender", "Occupation", "Mean Total Distance"))
Gender Occupation Mean Total Distance
1 Female Not Mentioned 836
2 Female Yes 1348
3 Male Not Mentioned 617
4 Male Yes 1040

Domestic and International Locations

Another way to think of the difference between narrating the self is the number of “domestic” and “international” locations. To be sure, the notions of “domestic” and “international” are fluid in the context of Partition. After all, Pakistan and India were “domestic” locations before partition. Perhaps a better way to think of it is to conceive of locations that are inside or outside the South Asian Sub-continent.
Using a spatial join in QGIS, all locations that fell within the Sub-continent were labeled “Subcontinent” and all thouse outside were labelled “international.” As all the operative narratives were in Bangladesh, India, and Pakistan, these were clubbed togetehr as “Subcontinent”

spokes_distance_world <- read_csv("data/hub_spoke_distance_international.csv")

With all of the locations by inside and outside the subcontinent, two separate tables were created. One for all the people who mentioned a location outside of the subcontinent at least, irrespective of the fact that they may have mentioned a “domestic” location. Using the anti_join function, the values from this table were filtered out of the overall table. This created the domestic table. The counts of these two tables were taken seperately. They were then bound back together.

spokes_international <- spokes_distance_world %>% 
                        filter(CNTRY_NAME=="International") %>% 
                        distinct(PersonID, .keep_all= TRUE) 
                       

                        

spokes_domestic <-  spokes_distance_world %>% 
                    filter(CNTRY_NAME=="Subcontinent") %>%   
                    anti_join(spokes_international, by="PersonID") %>% 
                      distinct(PersonID, .keep_all=TRUE) %>% 
                       group_by(gender,occupation, CNTRY_NAME) %>% 
                           count(name = "total")

spokes_international_count <- spokes_international %>% 
                            group_by(gender,occupation, CNTRY_NAME) %>% 
                         count(name = "total")

spokes_by_dom_int <- spokes_international_count %>% 
                      bind_rows(spokes_domestic) 

The table Subcontinent and International counts were filtered by Gender. The resulting tables were then used to tabulate the percentage of people who mention only subcontinent locations or international locations by occupation.

women_dom_int <- spokes_by_dom_int %>% 
                  filter(gender == "Female") %>% 
                  select(occupation,CNTRY_NAME,total) %>% 
                  ungroup() %>% 
                   mutate(percent= percent(total/sum(total),2)) %>% 
                    select(-gender) %>% 
                    arrange(occupation)
women_dom_int %>% 
addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>% 
htmlTable(
          header = c("Occupation", "Region", "Total", "Percent"),
          caption = "Women who mention domestic and international locations by occupation")
Women who mention domestic and international locations by occupation
Occupation Region Total Percent
1 Not Mentioned International 21 18%
2 Not Mentioned Subcontinent 47 42%
3 Yes International 20 18%
4 Yes Subcontinent 23 20%
men_dom_int <- spokes_by_dom_int %>% 
                  filter(gender == "Male") %>% 
                  select(occupation,CNTRY_NAME,total) %>% 
                  ungroup() %>% 
                   mutate(percent= percent(total/sum(total),2)) %>% 
                    select(-gender) %>% 
                    arrange(occupation)

men_dom_int %>% 
addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>% 
htmlTable(
          header = c("Occupation", "Region", "Total", "Percent"),
          caption = "Men who mention domestic and international locations by occupation")
Men who mention domestic and international locations by occupation
Occupation Region Total Percent
1 Not Mentioned International 3 2%
2 Not Mentioned Subcontinent 14 6%
3 Yes International 81 38%
4 Yes Subcontinent 111 54%

The results show that while the percentage of women who do not mention an occupation, but who do mention an international location is significantly higher than men who do not mention an occupation and mention an international location. Conversely, men who mention an international location tend to mention their occupation as well.